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1 Abstract 

Algorithms to study structural variants (SV) in whole genome sequencing 
(WGS) cancer datasets are currently unable to sample the entire space of re- 
arrangements while allowing for copy number variations (CNV). In addition, 
rearrangement theory has up to now focused on fully assembled genomes. 



not on fragmentary observations on mixed genome populations. This affects 
the apphcabihty of current methods to actual cancer datasets, which are pro- 
duced from short read sequencing of a heterogeneous population of cells. We 
show how basic linear algebra can be used to describe and sample the set of 
possible sequences of SVs, extending the double cut and join (DCJ) model 
into the analysis of metagenomes. We also describe a functional pipeline 
which was run on simulated as well as experimental cancer datasets. 

2 Introduction 

Current cancer genomic surveys [McLendon et al., 2008, Hudson et al., 2010] 
are collecting tremendous amounts of paired genomic datasets. Healthy and 
tumor tissues are taken from a cancer patient and sequenced independently. 
In theory, this allows analysts to distinguish germline variants, which are 
often imrelatcd to the tumor, from the somatic ones, which are more likely 
drivers of the disease. This approach has produced promising results with 
single nucleotide variants (SNV) [Ley et al., 2008] and copy-number variants 
(CNV) [Shhen and Malkin, 2009] . However, breakpoints, which connect two 
previously distant points of the genome (breakends), are much harder to 
study because of their combinatorial nature. For example, a gene can be fused 
to another gene [Tomlins et al., 2005]. A first step in examining breakpoint 
data consists in explaining the given data with an evolutionary history, thus 
providing insight on the order of events [Hannenhalli and Pevzner, 1999] , or 
re-evaluating the plausibility of the data [McPherson et al., 2012]. 

Theoretical models of balanced rearrangements, i.e. without creation or 
deletion of DNA material, have been abundantly studied, since they leave 
a characteristic signature [Bafna and Pevzner, 1993]: if there are no gains 
or losses of DNA, nor gain or loss of telomeric ends, then each segment end 
starts and finishes tied by a covalent bond to a partner. If we build a graph 
of breakends connected by edges to their partners before and after rear- 
rangement events occur, we necessarily obtain a set of cycles which alternate 
between created and deleted bonds. Such cycles have even been observed in 
prostate cancer genomes [Berger et al., 2011]. Prom this graph of cycles it 
is possible to infer optimal histories of rearrangements in polynomial or lin- 
ear time, depending on the preferred rearrangement cost model [Hannenhalli 
and Pevzner, 1999, Yancopoulos et al., 2005]. Yancopoulos et al. [2005] in- 
troduced the popular Double Cut and Join (DCJ) model of rearrangements. 
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in which each 2-hreak operation consists of two pairs of connected breakends 
being simultaneously disconnected then reconnected (in one of two possi- 
ble ways), thus forming a new genome. Following this idea, McPherson 
et al. [2012] described a pipeline to study arbitrarily complex patterns of 
rearrangements. However, exploring the space of non-optimal yet possible 
rearrangement histories is still mathematically complex [Durrett, 2005]. 

In contrast, studying the complete set of rearrangements while allowing 
for gains and losses of DNA copies is much more difficult. Yancopoulos 
and Friedberg [2009] proposed an extension of the DCJ model to evolution 
with insertions and deletions, restricting insertions and deletions to fixed 
regions. Bader [2009] generahzed this model, constraining duphcations to 
contiguous regions of the original genome, and providing a lower bound on 
the number of necessary operations. Braga et al. [2011] also provided a 
distance measure for rearrangements, insertions and deletions, but without 
allowing for duplicated regions. Shao and Lin [2012] extended this to a 
general model with full insertions, duplications and deletions of single atomic 
genes, but without reaching a closed formula of the distance separating two 
genomes. In practice, Greenman et al. [2012] presented a pipeline to study 
cancer datasets, searching for known rearrangement patterns. Although it 
allows for unbalanced rearrangements, it is limited to a small catalogue of 
simple rearrangements, thus preventing the study of complex rearrangement 
patterns, such as chromothripsis [Stephens et al., 2011]. 

To add to the complexity of the task, cancer tissues are increasingly un- 
derstood as a diverse environment of clonal colonies competing with one 
another [Merlo et al., 2006]. This heterogeneity and the dynamics of can- 
cer genomes are crucial to explaining clinical events, such as tumor relapse 
[Walter et al., 2012]. In addition, despite careful efforts, it is common for 
healthy tissue to be sequenced within the tumor sample, an artifact known 
as stromal contamination [Carter et al., 2012]. It is therefore necessary to 
approach a tumor sample as a metagenome, i.e. a mixture of genomes ob- 
tained from different sources, for a method to be robust on experimental 
data. Unfortunately, no existing model, to our knowledge, can handle such 
diversity. For this reason, instead of associating integer copy-numbers to 
segments [Medvedev et al., 2009], we propose the use of a continuous flow 
parameter, defined as the mean copy-number count across a population of 
cells. Prom sequencing data, this fiow is directly estimated by means of the 
observed number of sampled sequences which come from a given genomic 
regions, the read coverage. 
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We formalize here a model that extends the simplicity of DCJ cycles to 
insertions and deletions, extending the ones described in [Yancopoulos and 
Friedberg, 2009, Bader, 2009, Braga et al., 2011, Shao and Lin, 2012]. We 
show how, similarly to the Venetian method of double-entry bookkeeping, 
this model ensures that all quantifications are consistent, even in arbitrarily 
complex mixture models. Allowing for copy-number multiplicity introduces 
degeneracy of the observed variables, and we verify that this extension of 
an otherwise tractable problem is NP-hard. However, we also demonstrate 
that this algebraic representation allows us to decompose any input dataset 
into irreducible elements, thus providing a sound framework for the sampling 
of this combinatorial space. We finally propose a heuristic approach to the 
practical problem of studying the history of a metagenome, taking advantage 
of the fundamental notion of linear independence. This pipeline was applied 
to cancer sequencing data from the TCGA project [McLendon et al., 2008]. 

3 Results 

3.1 Representing static genomes 

3.1.1 Sequence graphs 

The following graphs are the setting that supports the rest of the analysis. 
They are used to describe sets of homologous genomes that are differentiated 
by rearrangements and copy number variations. They are also very similar to 
the graph structures such as de Bruijn graphs used to represent and assemble 
sequencing data [Medvdev et al., 2007]. 

Definition 1. A sequence graph E) is an undirected multi-graph where 
a set V of vertices are connected by a set E of edges. The edges of E are 
divided into segment and bond edges. There can be more than one segment 
edge between a pair of nodes {x^y) (hence the graph can be a multi-graph), 
but there must be at most one bond edge between each pair of nodes, and at 
least one segment edge incident on any node. In addition, for two segment 
edges {x,y) and {x',y'), ii x — x' then y — y'. They are then said to be 
homologous. There is no a priori constraint on the length or content of the 
labels of homologous segments. Although undirected themselves, segment 
edges have directed labels: if traversed in one direction they are equivalent 
to a given DNA sequence, whereas traversed in the opposite direction they 
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yield the reverse complement of that sequence. To avoid ambiguity in inter- 
pretation, if the two ends of a segment edge are the same vertex, then the 
edge label must be a DNA sequence that is identical to its reverse comple- 
ment, e.g. the empty sequence or a string of unknown bases (usually denoted 
by Ns) . A weighting function assigns a real- valued weight to each edge, such 
that edges with weight can be ignored. 

See Figure l.b for an example sequence graph. 

In all of the following, we assume an arbitrary fixed sequence graph 
G{V,E). To describe sequence variation, we define threads on the sequence 

graph: 

Definition 2. A cycle traversal is a closed walk through the graph, possibly 
with edge re-use, with an explicit starting point and directionality. A cycle 
is an equivalence class of cycle traversals which cover the same edges with 
the same multiplicities. 

Definition 3. A sequence traversal is a cycle traversal which alternates be- 
tween bond and segment edges, starting on a segment edge. Its thread is 
the associated cycle. A genome is a multiset of one or more threads (i.e. 
with threads possibly represented multiple times in the genome). A simple 
genome is illustrated in Figure I.e. 

Every sequence traversal of a thread is associated with a unique DNA 
sequence, produced by the concatenation of the segment edge labels in the 
direction they are traversed. The reverse traversal maps to the reverse com- 
plement sequence. 

3.1.2 Flows 

Flows are a family of real functions on the edges E of the sequence graph 
G{V,E) with a group structure. 

Definition 4. The edge space denotes the set of real-valued weightings 
on the set of edges E. 

is a vector space isomorphic to Rl^L 

Definition 5. A flow on G is a weighting in R^ such that at each vertex 
the total weight of segment edge incidences equals the total weight of bond 
edge incidences. We call this the balance condition. J-{G) denotes the set of 
all flows on G. 
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It is essential in the above to distinguish between edges and edge inci- 
dences. In particular, if a bond edge connects a vertex v to itself, its weight 
is contributed twice to v. 

Because the balance condition is conserved by addition and multiplication 
by a real scalar, J^{G) is a subspace of R^. 

Definition 6. J^ziG) denotes the set of integer-valued flows on G 

J^i{G) is a Z-modular lattice of the vector space J^{G). We will measure 
elements of J-z{G) with the Li norm, defined by ||/||i — ^^eE 1/(^)1 

Definition 7. An integer flow / e T^iG) is minimal if there do not exist two 
non-zero integer flows (/i, f-^ such that / = /i-f-/2 and ||/||i = ||/i||i + ||/2||i- 
We denote the set of minimal flows in G by M.{G). 

M.{G) is clearly a spanning set of J'z(G'), in the sense that all elements 
of T%{G) can be decomposed into a flnite sum of minimal flows in M.{G). 

Definition 8. T^{G) and T^{G) denote the sets of nonnegative and non- 
negative integer flows on G, respectively. 

3.1.3 Genome flows 

We now demonstrate how each genome can be associated with a speciflc flow 
function: 

Definition 9. For a particular thread t, its thread flow f{t) is the weighting 
which assigns to each edge the number of times it is covered (see Figure 
l.d). Because a thread t is an alternating cycle of bonds and segments, this 
weighting f{t) satisfies the balance condition, and hence is a fiow. Let T(G) 
denote the set of thread fiows on G. The genomic flow f{g) for a genome g 
is the sum of thread flows f{t) for the threads in g. Let Tm{G) denote the 
set of genome flows for all genomes on G. 

Lemma 1. fn(G) = T^iG). 

Proof. The proofs of all the lemmas and theorems of this manuscript are in 
the Supplemental Materials. □ 
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3.1.4 The conjugate balance condition and alternating flows 

We now define a particular and crucial set of integer flows used in our con- 
structions. A few technical definitions are necessary: 

Definition 10. For any weighting / in M^, its conjugate weighting f assigns 
/(e) to any segment edge e, and —/(e) to any bond edge e. 

See Figure l.e for an example weighting and its conjugate. 
It is clear that the conjugate transformation is symmetric (/ = /) and 
linear: V(/i, /2) E MG?Mp, q) e M.\ph^h = Pfi + qh- 

Definition 11. The condition that the sum of incident weights for every 
vertex is is called the conjugate balance condition. 

The balance condition can be reformulated in terms of a conjugate weight- 
ing: a weighting is a flow iff for every vertex the sum of incident conjugate 
weights is 0. In other words, / is a flow iff / satisfles the conjugate balance 
condition. 

Definition 12. For each edge e, let 5e be the weighting that assigns 1 to e, 
and to all other edges. 

The set {5e}ees forms a trivial spanning set of Z^. 

Definition 13. Given the cycle traversal t = {ej}j=i..2ri E E"^"- of an even 
length cycle c (possibly with edge re-use, possibly with 2 or more consecu- 
tive bond or segment edges), its associated alternating weighting is w{t) = 

Because the weights of the edges alternate between 1 and -1 along an 
even length cycle, an alternating weighting satisfles the conjugate balance 
condition, see Figure 2 b). 

Definition 14. An alternating flow is the conjugate of an alternating weight- 
ing. Let A{G) denote the set of alternating flows of G. 

Because an alternating flow is conjugate to a weighting that satisfies the 
conjugate balance condition, it satisfles the (normal) balance condition, and 
hence is a flow. See Figure 2 c). 

For the next two deflnitions, let t — {ej}j=i..2n £ be an even length 
cycle traversal on G. 
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Figure 1: (a) A block structure representation of a genome consisting of a 
circular chromosome of three sequence blocks with a tandem duplication of 
block B. (b) A sequence graph G with 3 segment edges (thick curved lines) 
corresponding to the homology blocks in (a) and all the possible intercon- 
necting bonds (thin straight lines), suitable for representing the genome, (c) 
the genome g corresponding to the block sequence of (a) represented as a sin- 
gle thread on G, (d) The associated genome flow f{g). Edges with positive 
weighting are indicated in red with numeric values nearby, edges with null 
weightings are left in grey, (e) The conjugate weighting of the genome flow 
f{g). The color scheme is the same, but blue edges indicate negative values. 
Note how the sign of the conjugate of the thread flow alternates between 
positive and negative values. 
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(a) (b) (c) 



B B B 




ODD 



Figure 2: (a) An even length cycle traversal, (b) its alternating weighting, 
and (c) the conjugate of the alternating weighting (i.e. the alternating flow 
of the cycle traversal). Note how the alternating flow respects the balance 
condition and the alternating weighting the conjugate balance condition. 



Definition 15. Two distinct indices i and j are said to coincide if Cj and 
Cj end on the same vertex of G. The traversal t is synchronized if whenever 
two distinct indices i and j coincide, then [i — j) is odd. 

See Figure 3 for an example. Note that a synchronized traversal cannot 
have more than two edges that end on the same node because for any three 
integers at least two are even or two are odd, and hence at least one of the 
pairwise differences among any three integers is even. Hence, a synchronized 
traversal visits each node at most twice. 




Figure 3: On the left, a non-synchronized cycle traversal: edges 3 and 7 
coincide, although (3-7) is even. A simple operation transforms it into two 
synchronized cycles, such that the sum of alternating flows is unchanged. 
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Definition 16. A set of pairs of integers P is nested if there does not exist 
(ii, ji), {12,32) € such that ii < 12 < ji < j2- The traversal t is nested if 
the set of pairs of coinciding indices of t is nested. 

See Figure 4 for an example. 




Figure 4: On the left, a non- nested cycle traversal: coinciding index pairs 
(2, 7) and (5, 8) are such that 2<5<7<8. A simple operation transforms 
it into two nested cycles such that the sum of alternating flows is unchanged. 



Definition 17. The alternating flow of a synchronized and nested even 
length cycle traversal is simple. We denote by S{G) the set of all simple 
flows on G. 

Because it contains no pair of coinciding indices, an alternating flow ob- 
tained from an even length traversal that does not visit any vertex twice is 
trivially a simple flow. Figure 2.c shows an example of a nontrivial simple 
flow. 

Lemma 2. A flow f on G is simple if and only if the graph of its unitized 
flow G{f) is a single- component even-odd cactus graph. 

Theorem 1. S{G) = MiG) 

On the one hand, this short theorem tells us that to determine whether 
a flow is minimal, it is only necessary to verify that it is the alternating flow 
to a synchronized and nested cycle, which clearly can be done in polynomial 
time (see 3.8 for more details). 

Conversely, this theorem also tells us that we can decompose the differ- 
ence between any two genome flows into a sum of simple flows. As we will 
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see in the following sections, simple flows are intimately linked to structural 
rearrangements. In particular, the problem of decomposing a balanced flow 
into minimal flows bears much resemblance to the problem of covering a bi- 
colored Eulerian graph with a maximal set of alternating cycles, as described 
by Caprara [1999]. In both cases, determining whether a cycle is minimal is 
trivial, but finding an optimal decomposition of a given problem, whether a 
flow or a colored graph, is NP-hard. 

3.2 Representing genome evolution 

3.2.1 Phased sequence graphs 

A genome g has only one flow f{g), but one flow / can represent many differ- 
ent genomes [van Aardenne-Ehrenfest and de Bruijn, 1951]. To disambiguate 
this degeneracy, we use an extension of G that avoids ambiguity: 

Definition 18. Given a genome g on sequence graph G, its associated phased 
sequence graph Gg is the sequence graph obtained by creating novel derived 
vertex-disjoint segment edges for the individual copies of the segments re- 
peated in the thread multiset g, and deleting non-traversed segment edges. 
The graph of bond edges is made complete, i.e. every node is connected by a 
bond edge to every other, including itself. The predecessor mapping ^ maps 
from the segment edges of Gg to the segment edges of G from which they 
came (Figure 5c). The mapping ^ is naturally extended so that any genome 
g' on Gg is mapped to a genome "^{g') on G, and any flow /' on Gg is mapped 
to a flow ^(/') on G. In particular, there is a genome g' on Gg that visits 
every segment in Gg exactly once and maps to the genome ^{g') — g on G. 
Such a genome g' is called a phased version of g. 

3.2.2 Evolutionary genome sequences 

Evolutionary processes create a series of changes to a genome involving du- 
plications, losses and rearrangements of DNA. 

Definition 19. An evolutionary genome sequence is a sequence S = {gi, . . . , 
of genomes where gi is deflned as a thread multiset on G, g2 is deflned as 
a thread multiset on Gg^, gz is deflned as a thread multiset on Gg^^^, etc. 
(Figure 7). To avoid deeply nested subscripts, we denote by Gi the graph re- 
sulting from iterative phasing of G by g^i, . . . , g^. Each pair (gfj, gi+i) is called 
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Figure 5: (a) a genome g composed of three blocks with a tandem duphcation, 
(b) the genome as a thread (multiset) on G and its flow, (c) the genome and 
its (phased) flow on its phased sequence graph Gg. The mapping ^ from Gg 
to G has ^'(-Bi) = ^(-62) = B, and * maps segments A and C in Gg 1-1 to 
their namesakes in G. 



a genome transition. For i > 1, is the mapping from Gi to its predeces- 
sor Gi-i associated with the transition {gi_i,gi), ^1 being the mapping of 
Gi onto G. The flow change 5fi of the transition {g^, gi+i) is the difference 
between the flows of the genomes in the transition projected onto Gf 

Sfi = f{^i+i{gi+i))- f{gi) 

The composition ^1 o . . . o v[fj is denoted ^j. The overall flow change of S 
A/ is the sum of these flow changes projected onto G: 

n-l 

3.2.3 Cost of genome sequences 

The most basic duplication, loss and rearrangement operations on genomes 
are as follows. 

Definition 20. Let g be a genome, and s and s' be threads in g. Let 

t = {vi, . . . ,Vn} G V'^ and f = {v[, ■■■,v'^} G V"'' be sequence traversals of 
respectively s and s' described as sequences of traversed vertices. 
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Phased sequence graphs : G < Gi — G2 < G3 ... < ^" 
Genomes and flow changes: gi ^^^> g2 — ^'^^ > gs ... '^'^"~^> gn 
Overall flow change ^'^ 



(on G) 

Figure 6: An evolutionary genome sequence 



• A thread removal of s in g consists of removing a copy s from g. 

• A complete thread duplication of s in g consists of adding a copy of s 
to g. 

• A n-fold tandem thread duplication of s for integer n > 2 consists of 
replacing s in gf with the thread associated with the concatenation of 
n instances of the thread t for s, producing a thread that is n times as 
long and still circular. 

• An inversion DC J operation at index 2i of the thread t for s in g corre- 
sponds to replacing s in with the thread associated with the sequence 
traversal {vi, V2, ...f 2i, Vn, fn-i, ■■■V2i+i}- 

• A splitting DCJ operation at index 2i of the thread t for s in g consists 
of replacing s with the threads associated with the sequence traversals 
{vi,V2, ...V2i} and {v2i+i, 

• A merging DCJ operation of the threads t for s and t' for s' in g cor- 
responds to replacing s and s' in g with the thread associated with the 
concatenation of t and t' . 

These operations are called elementary genomic operations. 

The above definition of DCJ operations is simply a reformulation of the 
one presented by Yancopoulos et al. [2005]. Because only DCJ operations 
are counted in the rearrangement cost, segmental (as opposed to complete 
thread) deletions and tandem segmental duplications can be explained with 
a rearrangement cost of 1, in agreement with previous models [Shao and Lin, 
2012]. 
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Genomes Genome Phased Conjugate Weightings 

Threads and Flow of the Phased 

Phased Flows Changes Flow Changes 




C2 



Figure 7: Top to bottom, an example evolutionary genome sequence viewed 
in 4 different representations. From left to right, the successive states of the 
genome represented as blocks, their corresponding threads and flows on the 
successive phased sequence graphs defining an evolutionary genome sequence, 
their transition flow changes, and the conjugate weighting of the flow changes, 
which can be summarized as alternating cycles of positive and negative edges. 
[DH: the titles on the pdf still needed to be updated to reflect the terminology 
now used in the text] 
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Definition 21. An elementary evolutionary genome sequence is an evolu- 
tionary genome sequence in which the pair of genomes in every transition 
are identical or differ by an elementary sequence operation. Its cost is equal 
to the number of DCJ operations it contains. 

The evolutionary genome sequence in Figure 7 is an example of an elemen- 
tary evolutionary genome sequence. Because it involves two DCJ operations 
(the segmental deletion and the inversion) it has cost 2. 

Definition 22. An evolutionary genome sequence 5* is a contraction of an 
evolutionary genome sequence S' if S is obtained from S' by omitting some 
genomes in the sequence S'. In this case we say that S' is an extension of 
S. An extension S' is elementary if is an elementary evolutionary genome 
sequence. The cost of an evolutionary genome sequence S*, denoted Cg{S), 
is the minimum cost of S' over all elementary extensions S' of S. 

We derive a useful lower bound on the cost of an evolutionary genome 
sequence in the next section that is tight for a special category of evolutionary 
genome sequences. 

3.2.4 Cost of a simple genome sequence 

Definition 23. An evolutionary genome sequence is simple if the flow change 
of every transition in it is simple. 

Theorem 2. An elementary evolutionary genome sequence is simple. 

Definition 24. A genome transition is a 1-step evolutionary genome se- 
quence. 

Definition 25. A valid flow change is a pair of nonnegative integer flows 
(/i, f2) G J^i{G) such that /i has value 1 for every segment edge. 

Definition 26. Let (/i, be an valid flow change. A realization of (/i, 
is any elementary evolutionary genome sequence S = {gi, . . . , gn} such that 
/(^i) = /iandA/(5) =/2-/i. The flow cost oiih,f2), denoted CF{h,f2), 
is the minimum cost of any realization of (/i, /2). 

Lemma 3. Every valid flow change has a realization, hence Ci?(/i,/2) is 
well-defined, and for any 1-step evolutionary genome sequence {gi^go): 

CF{f{gi),Mfi92))<CG{{gu92)) 
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(a) (b) (c) 




Figure 8: Two 1-step evolutionary genome sequences (a and b) which have 
the same overall flow change / (c). The second history cannot be realized 
in less than 3 DCJ operations, while the first one can be realized in 2 DCJ 
operations. This means that in this example, the flow cost of the second 
genome sequence (i.e. 2) is strictly lower than its cost (i.e. 3) 



Figure 8 illustrates where this inequality is strict. 

Definition 27. Let (/i,/2) be two genome flows on G. A bond edge in G 
is ancestral if it has positive flow in /i. A bond edge is created if it is not 
ancestral (i.e. has flow in /i) but has positive flow in /2. A bond edge is 
deleted if it is ancestral (has positive flow on /i) but has flow on /2. The 
complexity of a flow transition is defined by 

C(/i,/2) = n-\C\+L-0 

where: 

n is the number of ancestral bond edges, 

C is the set of connected components in G formed by ancestral and created 
bond edges, 

L is equal to Xlcec T^c/S] , where Ic is equal to the maximum number of edge- 
distinct non-alternating cycles of created and destroyed bonds in c, 

O is equal to Yliv<^v ^o,x{ay — 1, 0), where V is the set of vertices of G and 
is the number of ancestral bond edges incident on vertex v. 

Lemma 4. For any 1-step evolutionary genome sequence S = {gi,g2): 

C ifigi), ^2{f{92))) < Cf U\9l), ^2(/(^72))) < Cg{S) 
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The use of an optimal cycle decomposition in the complexity formula is 
similar to the decomposition used to formulate the distance in [Shao and 
Lin, 2012]. Because of these cycle decompositions, evaluating either formula 
is NP-Hard [Caprara, 1999] . Simple flows are what allow us to make full use 
of the complexity formula: 

Theorem 3. For any simple 1-step evolutionary genome sequence S — 
{91^92), C{f{gi),'if2{f{92))) can be computed in polynomial time and: 

m9i),Mfi92))) = Ci.{fig,),Mfi92))) = CoiS). 
By eoctension, for any simple evolutionary genome sequence S — {gi, . . . , g^) 

n-l 

J2Cifigi),^i+iifigi+i)))^CGiS). 

1=1 

3.3 Representing metagenome evolution 
3.3.1 Metagenomes 

Definition 28. A metagenome is a multiset of genomes on G, m = {gi, . . . gk}. 
The individual genomes in m are called clones. A weighted metageome (m, w) 
is a metagenome in which each clone g'j as a positive weight Wi = w{gi) such 
that Yli=i '"^i = 1- The metagenome flow associated with (m, w) is the convex 
combination of the flows of its clones: 

k 

f{m,w) = ^w{gi)f{gi). 
1=1 

We denote the set of all metagenome flows for weighted metagenomes on G 
by f{G). 

Definition 29. For any non-negative flow / G we say that the 

metagenome {m,w) realizes f if f{m,w) — f. 

Lemma 5. T(G) = T'^{G), i.e. every non-negative flow can be realized. 

In its biological interpretation, a metagenome represents the set of genomes 
from a community of cells, such as a bacterial community or the cells in a 
tumor tissue. Clones are sets of cells within the community that have iden- 
tical genomes because they share a common recent ancestor. The weights in 
a weighted metagenome measure the fraction of cells in the community from 
each clone. These weights change continuously as the cells divide and die, 
some clones expanding more successfully than others. 
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3.3.2 Evolutionary metagenome sequences 

Definition 30. Given a metagenome m on G, its phased sequence graph 
Gm is the union of the phased sequence graphs of its individual genomes. 
A predecessor mapping \& is built from the predecessor mapping of each of 
these individual phased sequence graphs. 

Definition 31. An evolutionary metagenome sequence is a sequence of metagenomes 
(mi, . . . , mnen) such that mi is defined on G, m2 is defined on Gmi, etc. Anal- 
ogously to evolutionary genome sequences, we define a sequence of phased se- 
quence graphs (Gi, . . . a sequence of predecessor mappings . . . 
metagenome transitions, transition flow changes . . . , and the over- 
all flow change A/ (see Figure 9) . 

Note that given an evolutionary metagenome sequence (mi, . . . ,mneN), 
a clone of m^+i can have threads on separate components of Gmu- This is 
equivalent to permitting lateral gene transfer. This possibility, which is very 
relevant to metagenomes in general, will be ignored in the case of cancer 
genomes, as such events are unlikely to happen between tumor cells. 

Definition 32. In a metagenome transition, because the phased sequence 
graph of a metagenome is clearly divided into components, each clone of the 
derived metagenome is unambiguously assigned a set of ancestral clones in 
the ancestral metagenome, hence defining an acyclic clonal derivation graph. 
Each branch of this graph describes a 1-step genome history between an 
ancestral clone and a portion of a derived clone. 

3.3.3 Cost of a metagenome sequence 

Definition 33. We define the following elementary metagenomic operations 
on a metagenome m: 

1. clone loss is the removal from m of a clone, 

2. subclone creation is the addition to m of a new clone g identical to a 
pre-existing clone g' . 

3. clone mutation is the alteration of a genome of m by an elementary 
genomic operation. 
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G 



mi : { 

-5/1 



(^?a,1.0) 



^2 m2 : {(5-6,0.7) 





(5c, 0.3) 



} 

} A/ 



(on G) 



5/2 



V 






ma :{(5<i,0.1) (5e,0.2) 



(5/, 0.7)} 



Figure 9: Example of an evolutionary genome sequence of weighted 
metagenomes and its clonal derivation graph. Note that although the weights 
within a metagenome add to 1, the weighting of a clone is independent from 
the weighting of its ancestral clones. 



These three operations define a general birth and death process of the 
clonal diversity within a population of genomes. 

Definition 34. An elementary evolutionary metagenome sequence is an evo- 
lutionary sequence in which two consecutive metagenomes are identical or 
differ by an elementary metagenomic operation. Its cost is the number of 
DCJ operations it contains. 

Note that weightings are not considered in cost, meaning that at every 
transition, an arbitrary reweighting of the clones can occur with no cost. 

Definition 35. Given an evolutionary metagenome sequence S — (mi, . . . , m^), 
and a subsequence (1, ii, . . . , ijt, n) of (1, n), we create a contraction S' of 
consisting of the metagenome sequence {gi, gi^, . . . , gi^,gn) and the predeces- 
sor mapping sequence (^'i, (^^2 o . . . o ^^J, . . . , (^ij,+i o . . . o ^'„)). S is then 
said to be a realization of S'. 

Definition 36. The cost Cg{S) of an evolutionary metagenome sequence is 
the minimum cost of any of its realizations. 
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3.3.4 Cost of simple metagenome sequences 

Definition 37. An evolutionary metagenome sequence is simple if the flow 
change of every 1-step genome transition along every branch of the clone 
derivation graph is simple or null. 

Theorem 4. An elementary evolutionary metagenome sequence is simple. 

Theorem 3 generalizes directly to evolutionary metagenome sequences: 

Theorem 5. For any simple evolutionary metagenome sequence S , Cg{S) 
is equal to the sum of the costs of the 1-step simple genome histories at each 
branch of the clone derivation graph. 

3.4 Fitting the model to experimental data 

We now show how the above explicit model of metagenome evolution can 
be fltted to the data provided by contemporary methods for whole genome 
sequencing. 

3.4.1 Flow sequences 

Definition 38. A flow sequence on G is a sequence of metagenomic flows of 
G such that if one of these flows assigns to a given segment edge, then all 
subsequent flows assign to that edge. 

Having projected all the flow changes onto G, we assign a real valued 
weight to each flow change equal to the sum of weights of derived clones in 
the final metagenome. This allows us to embed a problem within to one 
within R^: we are looking for a set of simple flow changes . . . , Sfn} e 
S{G) and a set of real weights {wi, . . . , Wn} £ [0, 1] such that: 

n 

Af = Y,^i5f, 

i=l 

Definition 39. Given a metagenome sequence S — (mi,...,m„), we can 
construct a flow sequence Sf: 

Sf^{Mfi9l)l---,^n{f{9n))) 

S is called a realization of Sf. 
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3.4.2 Simple flow sequences 

Definition 40. A simple flow sequence is a flow sequence such tfiat the 
difference between any two successive flows is a simple flow. 

Lemma 6. The flow sequence associated to an elementary metagenome se- 
quence is necessarily simple itself. 

The set of simple flow sequences is therefore representative of the entire 
set of elementary evolutionary metagenome sequences. 

3.4.3 Cost of fiow sequences 

Definition 41. The cost CpS of a flow sequence in T^{G) is the minimal 
cost across all of its realizations. 

Definition 42. The apparent complexity of a flow sequence S = (/i, /2, . . . , fn) 
is equal to: 

n-l 

C{S) = J2C{f^J^+l) 
1=1 

Lemma 7. For any flow sequence S: 

C{S) < Cf{S) 

Definition 43. For a given flow sequence S, its recombination cost is deflned 
as the positive number: 

Rc{S)^Cf{S)-C{S) 

This discrepancy between apparent complexity and cost is called recom- 
bination cost because it appears when projecting a flow from a phased graph 
onto G, thus confounding homologous sequences together. In those cases, 
DCJ events involving homologous breakends have a cost lower than 1 (see 
Supplement). The extreme case is a scenario where only (whole thread) du- 
plication and recombination events occur, i.e. DCJ events between pairs of 
homologous breakends, in which case it is easily verifled that the apparent 
complexity is necessarily 0. 
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3.4.4 Rooted Metagenome Transitions 

Definition 44. A rooted metagenome transition is a 1-step evolutionary 
metagenome sequence of weighted metagenomes S — {{w^^\ mi), 7712)), 
where mi has a single clone g that necessarily has weight 1. We refer to as 
the root genome of S. 

This representation fits the kind of data which can be obtained from 
whole genome sequencing. In particular, we can measure copy number along 
segments, and detect the creation of novel breakpoints, all of which can be 
represented as a metagenomic flow. By default, the root genome is assumed 
to be the (e.g. human) reference genome. 

3.5 Computational complexity analysis 

Although it bears similarities to many known NP-hard problems of rear- 
rangement theory [Pe'er and Shamir, 1998], the use of real- valued weightings 
requires us to verify the complexity of computing the cost of an evolutionary 
genome sequence of weighted metagenomes. 

Theorem 6. It is NP-hard to compute Cg{S) for a rooted metagenome tran- 
sition S. 

3.6 Heuristic sampling 

In the absence of an efficient exact algorithm, to compute the cost of a 
metagenome transition flow change A/, we must adopt a heuristic search of 
the parameter space. 

3.6.1 Recombination 

We start by restricting our parameter space to flow sequences. For simplicity, 
we ignore the recombination cost, in other words considering recombination 
events as special DCJ events with cost 0. This means that we can limit our 
computations to flow sequences, instead of creating the infrastructure needed 
for computing on phased sequence graphs. 
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3.6.2 Homeoplasy 

To reduce the parameter we also assume that homeoplasy is rare. In other 
words, we assume that while a particular rearrangement can be repeated 
identically, e.g. multiple simultaneous tandem duplications, it is very rare 
for two independent sets of rearrangements on divergent lineages to create 
the same flow change. In the context of metagenomic flow sequences, homeo- 
plasy is defined as sets of flow changes on divergent lineages within the clone 
derivation graph that are not linearly independent as vectors. 

However, to allow for repeated identical and simultaneous events such 
as tandem repeats, identical and simultaneous (i.e. having the same set of 
derived clones in the final metagenome) flows are added together into a sum 
flow change. The weight of this sum flow change, which is equal to the sum 
of weights of the merged flows, is no longer bounded by 1. To deconvolve 
how many flow changes were collapsed into a sum flow change, we assume 
by parsimony that a sum flow change with weight a actually represents [a] 
identical and simultaneous flows. For example, if the flow along a thread is 
increased by 1.5, it my have been duplicated in 75% of the leaf clones, or 
triphcated in 50% of the leaf genomes. We opt for the flrst solution as it 
involves fewer events. 

As noted earlier, we do not allow for lateral gene transfer, therefore we 
sample the space of clone derivation trees with weighted leaves such that 
each flow change is placed on a branch such that the sum of weights of its 
derived leaves is equal to its weight. 

3.6.3 Collapsing equivalent solutions 

Let K be the set of edges with zero overall flow change. In our heuristic 
algorithms, we will assume that segment edges which are part of K are never 
traversed by any flow. This amounts to stating that we do not allow dupli- 
cation and then deletion of segments on the regions which do not appear to 
have had any CNV changes. Normally, if this occurred, then the deletion 
would leave behind a created bond, which would then require another oper- 
ation to be deleted so as to fit the given data. Allowing for CNV changes on 
segments of K would therefore only augment the parameter space without 
improving our understanding of the complexity. 

Let Q be the set of vertices which are only incident to K. To prevent a 
combinatorial explosion because of equivalent cycles going through Q, it is 
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collapsed into a universal connector vertex, uj. In other words, if an alter- 
nating flow traversal has edges incident with vertices of Q, it is represented 
as going to u, self-looping on that vertex, then continuing out. This ensures 
that otherwise equivalent histories are not distinguished because of irrele- 
vant labeling differences. It is interesting to note that any alternating flow 
traversal which self-loops from u to itself more than once is automatically 
simplifled into a simple flow traversal with lesser complexity. This avoids 
extrapolating useless operations on the edges of K. 

3.6.4 Ergodic tour of the parameter space 

For practical purposes, we will extend our search to a less restricted set of 
flows: 

Definition 45. The traversal t is tight if = ej implies that {i — j) is even. 
See Figure 10 for an example. 




Figure 10: On the left, a non-tight cycle traversal: 2 and 5 overlap, although 
(2-5) is odd. A simple operation transforms it into a tight cycle traversal 
with identical alternating flow. 



Definition 46. A near-simple cycle traversal is an even-length cycle traversal 
which is tight and synchronized. Its derived flow is a near-simple flow. A 
near-simple flow sequence is composed of near-simple flows. 

Lemma 8. Under the above heuristic assumptions, the set of near-simple 
flow sequences with linearly independent flows which connect an original 
genome to a given metagenome flow is finite. 

This entails that any scoring scheme can be used to construct a valid 
probability mass function on this set of flow histories. 
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Definition 47. If there exist near-simple cycle traversals ti = {ei, 62..., e2n} 
and ^2 = {^i,^2---,^2u} such that they start at the same vertex and overlap 
over their last p > edges, but ei 7^ ei, we produce a third flow C3, such 
that: 

^3 can always be broken up into a set of near-simple cycle traversals T. 
This transformation of {ti,t2} into {ti} U T is called a merge. 

See Figure 11 for an example. 




Figure 11: Two overlapping flows transformed by a merge. 

Definition 48. Given two tight cycle traversals {ej}j=i..2n and {ei}i=i„2v 
which overlap over their flrst /, such that / is odd, we can construct three 
new cycle traversals: 

ti = {E,ei+i...e2n} 
t2 = {E,ei-f-i...e2u} 
ts = {ei,...ei,E} 

where £" is a bond edge which connects the destination vertex of e/ to the 
starting vertex of Ci. 

Alternately, we can build the following three cycle traversals: 

^1 = {El, E2, E3,ei+i...e2n} 
h = {El, E2, E3,ei+i...e2u} 
ts = {ei,...ei,E3,E2,Ei} 
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where Ei is a bond edge which connects the destination vertex of ei to uj, and 
E2 is a bond edge which connects u to u and connects u to the starting 
vertex of Ci. 

If I is even and greater than zero, then we construct the following cycle 
traversals: 

^1 = {Ei,E2,ei+i...e2n} 
h = {Ei,E2,ei+i...e2u} 
= {ei,...ei,E2,Ei} 

where Ei is a bond edge which connects the destination vertex of e/ to u, 
and E2 is a bond edge which connects u to the starting vertex of ei. 
This transformation is referred to as a split. 

See Figure 12 for an example. 




Figure 12: Two overlapping flows transformed by a split. 

This transformation is especially useful to ensure validity in a flow his- 
tory, since two cycles which overlap can be replaced with three independent 
cycles, which only overlap over rearrangement regions. This remaining over- 
lap cannot cause invalidity, i.e. genome flows with negative values. For this 
reason, invalid histories are simply penalized in our search with a surcharge 
for each observed case of invalidity, under the assumption that it is always 
possible to correct errors using splits. 
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Theorem 7. The process of applying random merges and splits is ergodic 
over all the independent near-simple flow sets which span a given net flow 
change. 

3.7 Tests 

3.7.1 Simulations 

To verify the strategy described above, we generated random evolutionary 
histories from a single circular chromosome composed of 100 atomic blocks. 
The evolutionary tree had a fixed depth of 10 stages, each stage had a 10% 
probability of clonal branching. Between two nodes of this tree, a single 
SV (70% inversion, 10 % tandem duplication, 10% deletion, 10% no event) 
was applied at random on the genome. The leaves of the tree were assigned 
random weights from a uniform distribution (1 to 10^). The pipeline was run 
for 100,000 iterations. A p-value was estimated by permutation tests. In the 
end the estimated minimal rearrangement costs had a 0.86 correlation with 
the true answer (p < 10~^). See Figure 13 for a summary of the results. 

3.7.2 Experiments on real data 

We ran the data on experimental 11 HTS datasets produced by the TCGA 
consortium [McLendon et al., 2008] consisting of paired Glioblastoma whole 
genome shotgun datasets: TCGA-06-0145, TCGA-06-0152, TCGA-06-0185, 
TCGA-06-0188, TCGA-06-0208, TCGA-06-0214, TCGA-06-0648, TCGA- 
06-0881, TCGA-06-1086, TCGA-14-1401, and TCGA-16-1460.stats. We pro- 
duced initial breakend and CNV calls with the BamBam variant caller {in 
prep.), then ran 10 independent processes sampling the space of histories us- 
ing a Metropolis Markov Chain Monte Carlo (MCMC) stochastic sampling 
strategy over 2500 iterations. Figure 14 shows how these independent pro- 
cesses consistently converge towards solutions of minimal complexity, despite 
occasional jumps in the search. 

3.8 Simple flows and cactus graphs 

To help understand better the properties of simple flows, we describe below 
some of their fundamental properties. 
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Figure 13: Evaluating the sampling algorithm against simulated histories. 
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Figure 14: Sampling histories on experimental cancer datasets. In each 
frame, a dataset is analyzed by 10 different processes (grey lines), each realiz- 
ing an MCMC search. The different plots show for each dataset the evolution 
of the rearrangement cost of the last visited history, throughout the MCMC 
process. The red line indicates the median complexity of the 10 process for 
a given iteration step, to provide an overall view of the convergence of these 
stochastic processes. 
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Definition 49. A cactus graph is a graph in which any two simple cycles 

intersect in at most one vertex, which may be called an intersection point 
[Harary and Uhlenbeck, 1952]. A cutpoint in a graph is a vertex that if 
removed splits the connected component in which it resides into two or more 
connected components called subcomponents. 

Cactus graphs are a natural generalization of trees that is used extensively 
in several fields [Paten et al., 2011]. In particular, it is easy to see that every 
intersection point in a cactus graph is a cutpoint, and hence the graph has 
the structure of a tree of cycles and connecting edges (see Figure 15. a). 

Definition 50. In a cutpoint split a cutpoint is not removed but rather 
replaced by a set of vertices, one vertex for each subcomponent that would 
have been created by its removal. An even-odd cactus graph is a a cactus 
graph in which each connected component has an even number of edges, and 
each cutpoint splits its component into two subcomponents, each with an 
odd number of edges (see Figure 15. b for an example). 




Figure 15: (a) A cactus graph, (b) an even-odd cactus graphs 



Definition 51. For any integer flow / on G, its unitized flow G{f) is the 
graph {Vf, Ef) where Vj is the set of the vertices of G that have an edge with 
nonzero flow incident upon them and Ef is such that for any pair of nodes 
{x,y) connected in G by an edge with nonzero flow n, there if there are \n\ 
edges between x and y in G{f), each with flow +1 if n is positive and each 
with flow —1 if n is negative. 
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4 Discussion 



There were similar attempts to adapt the DCJ model to insertions and dele- 
tions [Yancopoulos and Priedberg, 2009, Bader, 2009, Braga et al., 2011], but 
we chose to reformulate equivalent ideas with a modified representation. We 
believe this model with 4 types of edges (segment and bonds being created 
and deleted) is easier to understand. More importantly, it allowed us expand 
the model to real-valued quantities without extra notation. In addition, it 
provided us with a simple but solid framework to explore the large parame- 
ter space of evolutionary histories tying an original genome to an unphased 
metagenomc. Our model allows duplications and deletions of arbitrary re- 
gions, not predefined components, but does not allow de novo insertions, to 
avoid the "free lunch" problem pointed out by Yancopoulos and Friedberg 
[2009]. 

Prom a mathematical standpoint, the set of flows of G is very rich. It 
is bears similarity to the cycle space described by MacLanc [1937], with the 
added constraint of even-length cycles, hence the need to re-define simple 
flows in this framework. The balance condition guarantees that a positive 
flow can be decomposed as a set of weighted threads, and that in a flow 
change, every created bond is compensated by a either a corresponding bond 
deletion or a segment duplication. As illustrated in the figures of the Sup- 
plement, these conservative properties allow us to find a valid continuous 
transformation from one genome to another, akin to the concept of homo- 
topy in algebraic topology. 

The formulas exposed here are fully consistent with previous findings. 
Using our notation, if we compare two genomes in G (one ancestral, one de- 
rived) such that each genome visits each segment at most once, Yancopoulos 
and Friedberg [2009] demonstrated that the rearrangement cost of the his- 
tory is equal to n — |C|. If only the ancestral genome is bound to visit each 
segment at most once, Bader [2009] demonstrated that the cost has a lower 
bound which is close to, but slightly lower than n — \G\ + L (the positive 
term L as defined in this paper counts all odd length cycles, whereas Bader 
only counted cycles of length 1). 

We demonstrated here that any flow can be decomposed into simple flows 
(analogous to the AB-paths described in [Braga et al., 2011], or to the cycle 
decompositions in [Shao and Lin, 2012]), for which this bound is tight. This 
property is largely due to the fact that components of created and ancestral 
bonds associated with simple flows changes are planar, thus reducing the 
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complexity of the MAX-ECD problem from NP-Complete [Caprara, 1999] to 
polynomial time using Euler's formula. In addition, in the most general case, 
when there is no constraint on the ancestral and derived genomes, we proved 
that n— |C|+L — Oisa lower bound on the minimum cost of a realization. 

Prom a broader genomic point of view, the algorithm presented here infers 
the concentrations of the mixed genomes and their structure. This is sim- 
ilar to the problem of reconstructing population structure [Song and Hein, 
2005, Minichiello and Durbin, 2006] but without knowing the actual extant 
genomes or their weights. The model therefore phases all somatic variants as 
parsimoniously as possible. Because of the NP-completeness of the problem, 
we pragmatically assumed by the "infinite sites" argument that two homol- 
ogous points in the genome are unlikely to both be affected by distinct DCJ 
operations and the cost of homologous recombination can be ignored [Ma 
et al., 2008]. This approach is appropriate for current whole cancer genome 
datasets, where the short reads generally do not allow us to phase novel 
variants and there is no evidence of homologous recombination or horizontal 
DNA transfers between cancer cells. Local phasing, typically produced by 
long reads, could be introduced by phasing both the bonds and the segments 
(instead of just the segments). 

However, if the metagenome were provided with long distance phasing, 
for example draft assemblies of individual cells, then the model will have to 
be broadened, to deal with possible phasing contradictions which arise in the 
data. It would then have to allow homeoplasy and explicit recombination, 
and extend the descent tree to a directed acyclic graph, also known as an 
Ancestral Recombination Graph [Jotun Hein, 2006]. 

In a completely resolved world, where the leaf genomes are fully deter- 
mined, and the only remaining task is to infer the evolutionary history which 
ties them together, then the problem becomes even more difficult, and re- 
quires a more explicit representation, the Ancestral Variation Graph (AVG) 
[Paten et al., 2013]. Although not exphcitly demonstrated, the data struc- 
tures used here are very much compatible with the AVG theory. Supplemen- 
tal Figures 7 and 9 illustrate how a simple flow history can be expanded into 
an AVG. A simple flow can be expanded into a duplication, a rearrangement 
and a deletion epoch [Paten et al., 2013]. The branch flow changes in an flow 
history play the same role as the modules in a DNA history graph. Because 
metagenomic data provides only summary statistics on the extant species 
and not assembled sequences, the parameter space is greater than with fully 
resolved genomes, requiring extra assumptions (in particular linear indepen- 
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dence of flows) to devise efficient strategies. 

5 Conclusion 

We presented here a principled approach to the problem of reconstructing a 
cancer metagenome sampled with short reads, with only a reference genome 
and copy-number information on the sequences. Unlike most conventional 
genome rearrangement problems, the extant genomes and their concentra- 
tions are unknown, only the common ancestor is given. This obviously greatly 
expands the search space, and requires a scaleable solution. We verified that 
the problem is NP-complete but can be visited with simple sampling ap- 
proaches. 

We based our model off the DCJ model with duphcations and deletions, 
and showed how the set of allowed operations had interesting algebraic prop- 
erties which could be exploited for calculations. Mctagenomcs, genomes and 
rearrangements can all be viewed as elements of the same vector space in 
which they can be added and subtracted at will. This recourse to hnear alge- 
bra has multiple advantages. Firstly, it allows a clean transition from integer 
copy-number to real valued coverage depths. Secondly, it makes computa- 
tional implementation much more efficient, as highly optimized linear algebra 
libraries are available. Finally, it provides a direct definition of homeoplasy, 
with a simple transformation associated with it. 

6 Availability 

The code used to do the above simulations and tests is freely available at 
https: / / github.com/dzerbino/cn-avg. 
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1 Representing static genomes 



1.1 Sequence Graphs 

Definition 1. A sequence graph GiV^ E) is an undirected multi-grapli wliere 
a set V of vertices are connected by a set E of edges. Tlie edges of E 
are divided into segment and bond edges. There can be more than one 
segment edge between a pair of nodes (a;, y) (hence the graph can be a multi- 
graph), but there must be at most one bond edge between each pair of 
nodes, and at least one segment edge incident on any node. The two ends 
of a segment edge are necessarily distinct. In addition, for two segment 
edges {x,y) and {x',y'), if a; = a;' then y = y'. They are then said to be 
homologous. There is no a priori constraint on the length or content of the 
labels of homologous segments. Although undirected themselves, segment 
edges have directed labels: if traversed in one direction they are cqiiivalent 
to a given DNA sequence, whereas traversed in the opposite direction they 
yield the reverse complement of that sequence. A weighting function assigns 
a real- valued weight to each edge, such that edges with weight can be 
ignored. 

Definition 2. A cycle traversal is a closed walk through the graph, possibly 
with edge re-use, with an explicit starting point and directionality. A cycle 
is an equivalence class of cycle traversals which cover the same edges with 
the same multiplicities. 

Definition 3. A sequence traversal is a cycle traversal which alternates be- 
tween bond and segment edges, starting on a segment edge. Its thread is the 
associated cycle. A genome is a multiset of one or more threads (i.e. with 
threads possibly represented multiple times in the genome). 

1.2 Flows 

Definition 4. The edge space denotes the set of real-valued weightings 
on the set of edges E. 

Definition 5. A flow on G is a weighting in such that at each vertex 
the total weight of segment edge incidences equals the total weight of bond 
edge incidences. We call this the balance condition. J^{G) denotes the set of 
all flows on G. 
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Definition 6. Fz{G) denotes the set of integer-valued flows on G 

Definition 7. An integer flow / G T-LiG) is minimal if there do not exist two 
non-zero integer flows (/i, /2) such that / = /1+/2 and ||/||i = ||/i||i + ||/2||i- 
We denote the set of minimal flows in G by M.{G). 

Definition 8. T^{G) and T^{G) denote the sets of nonnegative and non- 
negative integer flows on G, respectively. 

1.3 Genome flows 

Definition 9. For a particular thread t, its thread flow f{t) is the weighting 
which assigns to each edge the number of times it is covered. Because a 
thread t is an alternating cycle of bonds and segments, this weighting f(t) 
satisfies the balance condition, and hence is a fiow. Let T(G) denote the set 
of thread flows on G. The genomic flow f{g) for a genome g is the sum of 
thread flows f{t) for the threads in g. Let Tk{G) denote the set of genome 
flows for all genomes on G. 

Lemma 1. fniG) = J^^iG). 

Proof. By construction, threads and genome flows are contained in J^^{G). 
To decompose a flow of J^^iG) into a sum of circular thread flows, we color 
the edges with non-zero weight such that segments edge are blue, and bonds 
red. We thus created a bi-colored Eulerian multigraph meaning that at each 
vertex and for each color, the sum of weightings of blue incident edges is 
equal to the sum of weightings of red incident edges. Finding a cover of 
this graph composed of alternating cycles can be done in polynomial time 
[Kotzig, 1968, Pevzner, 1995]. By construction the sum of the thread flows 
associated to these cycles is /. □ 

It is interesting to note that Caratheodory's theorem [Caratheodory, 1911] 
allows us to choose these genome flows such that q has an upper bound of 

\E\ + 1. 

1.4 The conjugate balance condition and alternating 
flows 

Definition 10. For any weighting / in M.^ , its conjugate weighting f assigns 
/(e) to any segment edge e, and —/(e) to any bond edge e. 
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Definition 11. The condition that the sum of incident weights for every 
vertex is is called the conjugate balance condition. 

Definition 12. For each edge e, let Se be the weighting that assigns 1 to e, 
and to all other edges. 

Definition 13. Given the cycle traversal t = {ei}i=i„2n £ -E^" of an even 
length cycle c (possibly with edge re-use, possibly with 2 or more consecu- 
tive bond or segment edges), its associated alternating weighting is w{t) — 

Definition 14. An alternating flow is the conjugate of an alternating weight- 
ing. Let A{G) denote the set of alternating flows of G. 

Supplemental Lemma 1. T{G) C A{G) 

Proof. If wc traverse a thread by starting on a bond edge, the associated 
alternating flow is: 

2n 

= * E E 

e&segments eebonds 

= E E 

e^segments eGbonds 
2n 

i=l 

f is equal to the thread flow of that path. □ 

Supplemental Lemma 2. Any flow in J^i{G) can be expressed as a lin- 
ear combination of thread flows, and by extension a linear combination of 
alternating flows. 

Proof. G is a connected graph such that every vertex has incident segment 
and bond edges. It is therefore possible to deflne a thread T which tours all 
its edges, allowing for edge reuse. Let /* be its thread flow. /* is a integer 
flow of G which is strictly positive over all edges. Let / be a flow of J-z(G), 
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and m be its minimum value. If m is non-negative, then / is already in 
J^^{G). Otherwise, (/ — mf*) is necessarily in J^^{G). f can therefore be 
described as a linear combination of positive integer flows (/ — mf*) and /*, 
which can themselves be decomposed into threads. □ 

Supplemental Lemma 3. Exactly two opposite alternating flows /"*" and 
f~ = —f'^ are defined from the traversals of an even length cycle. 

Proof. Because the function (p '■ i ^ has a periodicity of 2, shifting the 

starting edge by an even number of positions produces an identical flow. In 
addition, because the length of the cycle is even, for a given starting edge, 
the resulting alternating flow is independent of the direction of traversal. 
Finally, because (j) alternates between opposite values, displacing the starting 
vertex and edge by one position is equivalent to multiplying the associated 
alternating flow by —1. □ 

Supplemental Lemma 4. Any flow in TjiG) can he decomposed as a sum 
of alternating flows. 

Proof. T{G) is a spanning set of the modular lattice Tz{G) and T{G) is 
a subset of A{G), therefore A{G) is a spanning set of the modular lattice 
J^i{G). Because every alternating flow has an opposite alternating flow de- 
flned from the same cycle, any flow in T%{G) can be decomposed as a sum 
of alternating flows. □ 

Supplemental Lemma 5. For f G M^, = 

Proof. On every edge e, |/(e)| = |/(e)|. □ 

Supplemental Lemma 6. Given an even length cycle traversal t = {ei}i=i_,i 

and its associated alternating flow f , ||/||i < /. 

Proof. This inequality is a simple consequence of the definition of the alter- 
nating flow of a cycle traversal: 



1 = \\Y.^-me.h 

i=l 

I 



< 



< I 



□ 
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Supplemental Definition 1. The traversal t is tight if = ej implies that 
{i — j) is even. 

See Figure 1 for an example. 




Figure 1: On the left, a non-tight cycle traversal: 2 and 5 overlap, although 
(2-5) is odd. A simple operation transforms it into a tight cycle traversal 
with identical alternating flow. 



Supplemental Lemma 7. The above relationship becomes an equality iff t 
is tight. 

Proof. For an edge e of G, let Ne be the (possibly empty) set of indices 
for which Cj = e. Let us suppose that t is tight. This means that Ve G 
E,y{iJ) e {-ly = {-ly. We then have: 



■>ei 1 



e l 



eeE ieNe 



E 

Ei'^. 



E(-i)' 



Conversely, let us suppose that there exists p and q such that Cp = Cq yet 



p — q is odd. This means that (— 1)^ 



■1)'', we therefore have: 
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E 

eeE 



ieNe 



< 



< 1-2 
1 < / 



E 


E(- 


-ly 


+ 


E (-!)■ 


eeB\{ep} 








«eArep\{p,9} 


E 




+ (l^eJ-2) 


eeE\{ep 


} 









□ 



Supplemental Lemma 8. If f is minimal then there exists a tight even- 
length cycle c such that f is an alternating flow of c. 

Proof. Let / be a minimal flow. We construct a bi-colored multigrapli Hf 
composed of edges which have non-zero weighting under /. Edges with posi- 
tive weight under / are colored in red, those with negative weight are colored 
in blue. Because of the conjugate balance condition of /, we constructed a 
bi-colored Eulcrian graph, hence it is possible to find an alternating cycle 
decomposition {q} of this graph [Kotzig, 1968, Pevzncr, 1995]. By construc- 
tion, XI I ~ II /111- By choosing traversals which start on blue (/ negative) 
edges, we construct a set of alternating flows {/i}, such that = /. Be- 
cause each cycle Cj alternates in colors, it is necessarily tight. The above 
lemma tells us that = X^IqI Il/||i- This cycle decomposition 

cannot contain more than one cycle Ci, else / would not be minimal. We 
therefore have = |ci|. □ 

See Figure 1 for an example of constructing a tight traversal of a minimal 
flow. 

Supplemental Lemma 9. If a flow f is not minimal, i.e. it can he decom- 
posed into flows fi and f2 such that ||/i||i + II/2II1 = ||/||i; then a pair of 
tight traversals of fi and f2 must cover only edges where f is non-zero and 
have the same sign as f . 
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Proof. This follows from the fact that ||/i||i + II/2II1 = □ 

Definition 15. Two distinct indices i and j are said to coincide if Cj and 
Cj end on the same vertex of G. The traversal t is synchronized if whenever 
two distinct indices i and j coincide, then {i — j) is odd. 




Figure 2: On the left, a non-synchronized cycle traversal: edges 3 and 7 
coincide, although (3-7) is even. A simple operation transforms it into two 
synchronized cycles, such that the sum of alternating flows is unchanged. 



Definition 16. A set of pairs of integers P is nested if there does not exist 
(ii, ji), {12,32) € such that ii < 12 < ji < 32- The traversal t is nested if 
the set of pairs of coinciding indices of t is nested. 




Figure 3: On the left, a non-nested cycle traversal: coinciding index pairs 
(2, 7) and (5, 8) are such that 2<5<7<8. A simple operation transforms 
it into two nested cycles such that the sum of alternating flows is unchanged. 
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Definition 17. The alternating flow of a synchronized and nested even 
length cycle traversal is simple. We denote by S{G) the set of all simple 
flows on G. 

Supplemental Lemma 10. A simple flow is necessarily tight. 

Proof. We shall demonstrate this by contradiction. Let us assume that in 
cycle traversal t — {ei, . . . , e„}, = Cj and {i — j) is odd. 

If Ci and Cj are traversed in the same direction and \i— j\ > 1, then i and 
j coincide, but also {i — l)mod{n) and (j — l)mod{n). These fours indices 
are distinct and not nested, hence t would not be nested. 

If Cj and Sj are traversed in the same direction and |i — j| = 1, then 
that edge is necessarily a circular bond edge, which connects a node to itself. 
Orientation is therefore irrelevant, and Cj and Cj can be considered as going 
in opposite orientations, which leads to the following case. 

If Cj and Cj are traversed in opposite directions, then {{i — l)mod{n)) and 
j coincide, and {{i — l)mod{n)—j) is even, hence t would not be synchronized. 

□ 

Theorem 1. S{G) = M{G) 

Proof. Let us flrst demonstrate that Ai{G) C S(G). Let / be a minimal 
flow, and t = {ej}j=i. / one of its tight traversals. We will demonstrate by 
contradiction that t is synchronized and nested. 

We will first assume that t is not synchronized, i.e. there exist distinct 
indices i and j such that i and j coincide, yet (i — j) is even. Without 
loss of generality, we assume that i < j. Let ti = {ci^i, ...Cj} and t2 = 
{ci, Cj, ej_|_i...e;}, we verify that w(ti) + w(t2) = f, and + |t2| = I- t is 
tight, and it is easy to verify that this property is preserved in ti and t2. Hence 
\\w{ti)\\i + \\w{t2)\\i = \ti\ + |t2| = I. Therefore = 11^^(^1)111 + \\w{t2)\\i, 
hence / could not be minimal. 

We then assume that t is synchronized, but not nested, i.e. that there 
exist indices i < p < j < q such that and {p,q) are pairs of coinciding 

indices. Then the traversals ti = {ei, ...Cj, Cj, ej_i...ep_|_i, e^+i, eg+2...e;} and 
h = {cj+i, ei+2...ep, Cq, Cq-i..., Cj+i} can similarly be used as a decomposition 
of /, verifying that / could not be minimal. Thus, t is both synchronized 
and nested. / is therefore simple. 

We now demonstrate that S{G) C Ai{G). Let t = {ej}j=i../ be a syn- 
chronized and nested cycle traversal with flow /. Because it is impossible 
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to find three indices i,j and k such that {i — j), {j — k) and {k — i) are all 
odd, t can only visit any vertex at most twice. As seen above, t necessarily 
describes a cactus graph which can be vertex-partitioned into simple cycles. 
In addition, at each overlap vertex between two simple cycles, at most two 
components can be connected. The simple cycles of the cactus graph form a 
tree. Because t is synchronized, if / = /i + /2 is a decomposition of / and 
ti is a traversal of /i, ti must necessarily follow the component it is on, and 
change component whenever it reaches a junction node. Thus ti traverses all 
of the vertices at least once, and thus is equal to t. Thus / is minimal (see 
Figure 14 for an illustration). □ 

2 Representing genome evolution 

2.1 Phased sequence graphs 

Definition 18. Given a genome g on sequence graph G, its associated phased 
sequence graph Gg is the sequence graph obtained by creating novel derived 
vert ex- disjoint segment edges for the individual copies of the segments re- 
peated in the thread multiset g, and deleting non-traversed segment edges. 
The graph of bond edges is made complete, i.e. every node is connected by a 
bond edge to every other, including itself. The predecessor mapping ^ maps 
from the segment edges of Gg to the segment edges of G from which they 
came. The mapping ^ is naturally extended so that any genome g' on Gg 
is mapped to a genome "^{g') on G, and any flow /' on Gg is mapped to a 
flow ^(/) on G. In particular, there is a genome g' on Gg that visits every 
segment in Gg exactly once and maps to the genome 'if{g') = g on G. Such 
a genome g' is called a phased version of g. 

2.2 Evolutionary genome sequences 

Definition 19. An evolutionary genome sequence is a sequence S = {gi, . . . , g^) 
of genomes where g\ is defined as a thread multiset on G, is defined as 
a thread multiset on Gg^, g^ is defined as a thread multiset on Gg-^^^, etc. 
To avoid deeply nested subscripts, we denote by Gi the graph resulting from 
iterative phasing of G hj gi, . . . , g^. Each pair {gi,gi+i) is called a genome 
transition. For i > 1, is the mapping from Gi to its predecessor Gj-i 
associated with the transition {gi-i, gi), being the mapping of Gi onto G. 
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The flow change 5fi of the transition {gi,gi+i) is the difference between the 
flows of the genomes in the transition projected onto Gf 

6f\ = f{^,+i{gi+i))~ f{gi) 

The composition o . . . o is denoted The overall flow change of S 
A/ is the sum of these flow changes projected onto G: 

n-l 
i=l 



Phased sequence graphs : G < ^ Gi <r^ — G2 < — — Gz ■ ■ ■ < — — Gr. 
Genomes and flow changes: gi ^^^> g2 — > gz ... '^'^"~^> gn 
Overall flow change -j—^^ > 

(on G) 



Figure 4: An evolutionary genome sequence 



2.3 Cost of a genome sequence 

Definition 20. Let gf be a genome, and s and s' be threads in g. Let 
t — {vi, . . . , Vn} e V'^ and t' — {v[, v'^} G be sequence traversals of 
respectively s and s' described as sequences of traversed vertices. 

• A thread removal of s in g consists of removing a copy s from g. 

• A complete thread duplication of s in g consists of adding a copy of s 
to g. 

• A n-fold tandem thread duplication of s for integer n > 2 consists of 
replacing s m. g with the thread associated with the concatenation of 
n instances of the thread t for s, producing a thread that is n times as 
long and still circular. 
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• An inversion DC J operation at index 2i of the thread t for s in g corre- 
sponds to replacing s in g with the thread associated with the sequence 
traversal {vi, V2, ...V2i, v^-i, ■■■V2i+i}. 

• A splitting DCJ operation at index 2i of the thread t for s in g consists 
of replacing s with the threads associated with the sequence traversals 

{^1,^2, ...V2i} and {^2^+1, ■■■Vn}- 

• A merging DCJ operation of the threads t for s and t' for s' in g cor- 
responds to replacing s and s' in g with the thread associated with the 
concatenation of t and t'. 

These operations are called elementary genomic operations. 

Definition 21. An elementary evolutionary genome sequence is an cvohi- 
tionary genome sequence in which the pair of genomes in every transition 
arc identical or differ by an elementary sequence operation. Its cost is equal 
to the number of DCJ operations it contains. 

Definition 22. An evolutionary genome sequence S* is a contraction of an 
evolutionary genome sequence S' if S is obtained from S' by omitting some 
genomes in the sequence S'. In this case we say that S' is an extension of 
S. An extension S' is elementary if is an elementary evolutionary genome 
sequence. The cost of an evolutionary genome sequence S, denoted Cg{S), 
is the minimum cost of S' over all elementary extensions S' of S. 

2.4 Cost of a simple genome sequence 

Definition 23. An evolutionary genome sequence is simple if the flow change 
of every transition in it is simple. 

Theorem 2. An elementary evolutionary genome sequence is simple. 

Proof. If the SV is a DCJ operation, Bafna and Pevzner [1993] showed that 
the difference between the two genomes is a simple cycle of bond creations 
and deletions. This cycle corresponds to a simple flow. 

If the SV is the loss or gain of a circular DNA product, the flow change 
is equal to a thread flow being added or subtracted, i.e. an alternating flow. 
By construction, a thread in the ancestral genome never visits any edge of 
its own phased graph twice. Its thread flow is therefore a simple flow on the 
phased graph. □ 
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Definition 24. A genome transition is a 1-step evolutionary genome se- 
quence. 

Definition 25. A valid flow change is a pair of nonnegative integer flows 
{fi,h) G J^zi^) ^^(^^ that /i has value 1 for every segment edge. 

Definition 26. Let (/i, /2) be an valid flow change. A realization of (/i, /2) 
is any elementary evolutionary genome sequence S — {gi, . . . , gn} such that 
f{g,) = h and A/(5) = /2-/1. The flow cost of (/i, /2), denoted Cf(/i, /s), 
is the minimum cost of any realization of (/i, /2). 

Lemma 2. Every valid flow change has a realization, hence Cp{fi, is 
well-defined, and for any 1-step evolutionary genome sequence {gi,g2)-' 

CF{f{9i),Mfi92))<CG{{9u92)) 

Proof, fi and /i + /2 can always be decomposed as genome flows, thus defin- 
ing an ancestral and a derived genome. The inequality is simply due to the 
fact that Cf{S) is defined as the minimum of a set that includes Cg{S) □ 

Definition 27. Let (/i,/2) be two genome fiows on G. A bond edge in G 
is ancestral if it has positive fiow in /i. A bond edge is created if it is not 
ancestral (i.e. has fiow in /i) but has positive fiow in /2. A bond edge is 
deleted if it is ancestral (has positive flow on /i) but has flow on /2. The 
complexity of a flow transition is deflned by 

C(/i,/2) = n-\C\+L-0 

where: 

n is the number of ancestral bond edges, 

C is the set of connected components in G formed by ancestral and created 
bond edges, 

L is equal to ^^ec T^c/S] , where 1^ is equal to the maximum number of edge- 
distinct non-alternating cycles of created and destroyed bonds in c, 

O is equal to ^^gy max{ay — 1, 0), where V is the set of vertices of G and 
tty is the number of ancestral bond edges incident on vertex v. 
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Lemma 3. For any 1-step evolutionary genome sequence S — {gi,g2)-' 

C{f{gi),^2{f{92))) < CF{f{gi),^2{f{92))) < Cg{S) 

Proof. To prove this bound we will demonstrate that appending a DCJ event 
to the end of a evolutionary sequence increases the complexity of the overall 
flow change by at most 1, and appending a duplication or deletion event 
does not affect the complexity. Thus we will have demonstrated that any 
evolutionary sequence has a overall flow change whose complexity is a lower 
bound on the number of DCJ events in the sequence. Conversely, given a 
overall flow change, all of its realizations have a rearrangement cost greater 
or equal to its complexity. 

A whole thread duplication event, whether tandem or not, or a whole 
thread deletion event, cannot create new bonds, therefore they do not affect 
C or the number of odd length loops in components of C. Because the overall 
flow change is phased, the term O is necessarily constantly equal to 0. The 
overall flow change complexity is therefore not affected. 

Appending a DCJ event to a evolutionary sequence cannot affect the 
number of ancestral bonds, or the term O. We will therefore discuss the 
effect it has on |C| and L. Because every vertex in a phased sequence graph 
has exactly one ancestral bond, a component can be split up, but never 
removed entirely. As demonstrated by Bader [2009], |C| decreases by at 
most 1. If L does increase, then it cannot increase by more than 2, since at 
most one new cycle can be created by the addition of a created bond, and 
a DCJ creates at most two bonds. If L increases by 1 then all the vertices 
involved in the DCJ were previously on the same component of C : they were 
previously attached in pairs, and the two vertices involved in closing a loop 
were previously unattached but on the same component. Consequently, the 
operation cannot merge components. \C\ can only increase. Finally, if L 
increases by 2, then the two created loops are necessarily on two different 
components. Since (as in the previous case), the four vertices involved in the 
DCJ were previously on the same component, this means that a component 
was split into two, hence |C| increases by 1. In all cases, a DCJ event increases 
C by at most 1. □ 

Supplemental Lemma 11. Given a simple flow change on a phased graph, 
the term Ic can be computed in polynomial time. 

Proof. In all generality, the maximum number of cycles within a graph is an 
NP-hard problem [Caprara, 1999]. However, since the flow is minimal, no 
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component of created and destroyed can contain an alternating cycle. Since 
the graph is a cactus graph (see 6 for a demonstration), none of the bounded 
faces share an edge. Computing the maximum number of non-alternating 
cycles for an element c G C is therefore equivalent to computing the number 
of bounded faces in c. That can be done in linear time with Euler's formula, 
which computes the number of bounded faces of a planar graph as the number 
of edges, minus the number of nodes, plus 1. □ 

Theorem 3. For any simple 1-step evolutionary genome sequence S — 
{91-192), ^2(/(5'2))) can he computed in polynomial time and: 

C{f{9i), ^2(/(^2))) = if {91), Mf{92))) = Cg{S). 
By extension, for any simple evolutionary genome sequence S = {gi, . . . , g^) 

n-l 

J2C{f{g^),^^+l{f{9^+l))) = CG{S). 
1=1 

Proof. For an arbitrary valid flow change (/i,/2), we already proved that 
its complexity is a lower bound on the minimum rearrangement cost of its 
realizations. If we construct a realization whose cost is equal to this lower 
bound, its cost is necessarily minimal, and the minimum cost is therefore 
equal to the lower bound. We also demonstrated that its complexity could 
be computed in polynomial time. 

Wc propose to construct of a sequence of genome pairs {pi = {gs,gt)---Pi = 
{9i^9i)i •••}) and track the evolution of {C{f{gl), f{gi))}i=i..p, as we iteratively 
modify either the ancestral or the derived genome of the pair. See figures 5 
and 7 for illustrated examples. As wc will see, the two genomes will progres- 
sively become more similar, we are effectively constructing the evolutionary 
sequence from both ends (see figures 6 and 8). The reason for this bilateral 
construction is to keep the complexity calculations as simple as possible. The 
term O is equal to throughout this proof. 

We first simplify non- alternating cycles of created and deleted bonds. 
Since every vertex has exactly one incident ancestral bond, such a cycle 
contains at least one closing vertex which has two bond creation incidences 
(a self- looping bond would have two incidences on the same node). The fact 
that the fiow change is simple imposes that the final fiow f{gl) is bounded by 
3 on all edges, therefore the sum of weights of these bonds is also bounded 
by 3. One of these bond creation incidences therefore has a fiow of 1 in 
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figj), i.e. it is unique in gj. If a component has more than two odd length 
cycles, we pick a closing vertex from two of those cycles, and a unique created 
bond incident on each of these nodes. With a single DC J operation on gl_^i, 
these two bonds are broken and the corresponding vertices are reconnected 
arbitrarily among themselves (without reforming the two bonds that were 
broken) . Constructing a simple traversal of the modified flow change is trivial 
provided a simple traversal of the previous flow change, therefore the resulting 
overall flow change is still simple. During this operation, 6n = 0,6\C\ = 
0, 6L = —1, 6C = —1. We repeat this operation until no component of C has 
more than one remaining cycle. 

We then realize the duplications on gf. A duplication region is a con- 
nected component of bond and segments on Ggs with positive original flow 
and positive overall flow change, which contains at least one segment. To 
analyze these components, we describe a path W through gj which starts 
from an arbitrary duplicated segment edge s, extends on both ends along its 
thread alternating between duplicated segments and bonds, and terminating 
as soon as it encounters a created or non-duplicated bond or a non-duplicated 
segment. All the traversed segments are necessarily derived from the same 
thread T in gf. We discuss with respect to how this iteration terminates: 

A) If this path forms a cycle, it corresponds either to a duplicated thread 
in gj copied from a single thread T in gf, or a, single circular thread in gj 
produced by to the tandem duplication of T. We therefore duplicate (possibly 
by tandem circular duplication) T in gf^i- The segments derived from T in 
gl_^^ are assigned ancestral segments on the descendant (s) of T in gf_^_-^ such 
that if two segments in gl_^^ are connected by a bond, their ancestors in gf_^_^ 
are connected by a bond. 

B) If the path terminates on one end on a bond b with overall flow change 
and on the other with a created bond (see Figure 7), then b corresponds to 
a unique anchor bond b' in gf. We therefore realize a tandem duplication 
of T in gf_^_^. The duplicated thread has two copies of 6', partitioning the 
thread into two homologous sub-threads, one primary, the other secondary. 
The segments oiT\W on glj^-^ are marked as derived from the primary sub- 
thread in gf^i, whereas the segments of W are marked as derived from the 
secondary sub-thread in gfj^-y. 

C) If the path terminates on both sides with created edges (see Figure 5), 
T is duplicated entirely in gl_^.i, creating two copies, one primary, the other 
secondary. The segments oi T \ W on gj_^i are marked as descendants of 
the primary copy, and those of H^in gj^^ are marked as descendants of the 
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secondary copy. 

D) If the path terminates on both sides with bonds bi and 62 with 
overall flow change, then we repeat the process, starting from a sibling copy 
of s, creating a new path W. W cannot traverse sibling bonds of bi or 62, 
therefore it will necessarily terminate in case C). 

We repeat until there is no duplication region left. At the end of this 
process, every segment has negative or zero flow change, meaning that it has 
at most one incident ancestral bond, and one incident created bond. There 
are necessarily no internal cycles left in the components of C. 

We then verify that after all these duplications, SC — 0. Each thread 
which is duplicated increases the number of ancestral bonds n. Of those 
bonds many are in singleton components: within the duplicated region W all 
the derived bonds are identical to the ancestral bond. Conversely, between 
the segments of the secondary copies which are not ancestral to W and 
therefore have no descendants, there is simply a deleted bond. The creation 
of each of these ancestral bonds therefore increases |C| by 1. Regarding the 
ancestral bonds which are flanking W a discussion is necessary. We note 
that after these duplications, any ancestral bond is either destroyed or on 
a component of its own, since there are no duplicated segments left. For 
a given ancestral bond flanking one of the duplicated regions, either it is 
on a separate component from the bond it was copied from, in which case 
5\C\ = 1, else it is still connected, in which they are necessarily connected 
by a path of created and destroyed bonds, which correspond to a loop of L 
which was opened up, hence 6L = —1. 

We then undo the deletions in gj. A deletion region is a connected compo- 
nent of bond and segment deletions, which contains at least one segment edge. 
Threads which arc entirely removed by a circular deletion regions are sim- 
ply added to gj^i- The other deletion regions are necessarily linear, as they 
must be included within the ancestral genomes. For each vertex at the end 
of a deletion region, we follow the corresponding component of created and 
deleted edges until reaching the next vertex which flanks a deletion. This pro- 
cess iteratively defines a path through the graph, and necessarily loops back 
to the initial node, thus defining a finite thread (due to the finite length of 
the traversal of a simple flow) which alternates between deletion regions and 
bonds between their extremities. This thread is added to gj, such that created 
bonds do not to create or join components 5n — 0, 5\C\ — 0,6L — 0, 5C — 0. 

We flnally realize the rearrangements. The flow difference between gf and 
gj is necessarily a simple flow of bond duplication and deletions. Because 
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two bond deletions cannot be incident on the same node, no vertex can be 
visited twice. As shown by [Yancopoulos et ah, 2005], it is possible to reduce 
components containing a deleted edge, such that at each DCJ operation, 
Sn - 6\C\ = -1, 6L = 0, 5C = -1. 

We therefore constructed a evolutionary sequence starting at genome gt 
and finishing at gg using exactly C{gs,w{gt) — w{gs)) DCJ operations. □ 

3 Representing metagenome evolution 
3.1 Metagenomes 

Definition 28. A metagenome is a multiset of genomes on G, m = {gi, . . . g^.} 
The individual genomes in m are called clones. A weighted metageome (m, w) 
is a metagenome in which each clone gi as a, positive weight Wi — w{gi) such 
that Yli=i '"^j = 1- The metagenome flow associated with (m, w) is the convex 
combination of the fiows of its clones: 

k 

f{m,w) = ^w{gi)f{gi). 

1=1 

We denote the set of all metagenome flows for weighted metagenomes on G 
by f{G). 

Definition 29. For any non-negative flow / e T'^{G), we say that the 
metagenome (m,w) realizes f if /(m, w) = /. 

Lemma 4. 7'{G) — T'^{G), i.e. every non-negative flow can be realized. 

Proof. Proving the inclusion T{G) C J^~^{G) is trivial because, by construc- 
tion, thread, genome and metagenome flows are all positive flows. 

We shall prove the equality T{G) — jF^{G) with the iterative construction 
of a metagenomic flow equal to a given positive flow /. The initial target 
flow is /o = /. 

At each iteration n, we start from an arbitrary starting segment edge with 
non-zero flow, and follow a thread path t that alternates between segment 
and bond edges with strictly positive target flow. Because the target flow is 
balanced, it is always possible to close such a path in finite time. For each 
traversed edge, we divide the target fiow by the number of times the edge was 
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Figure 5: Finding a sequence of transformations which explain a simple flow 
difference. At each step, we modify the starting (top) or the end genome (bot- 
tom). We represent next to them the phased flow difference. For simplicity, 
bond edges with zero ancestral or derived flow are not represented. To the 
right, we compute the complexity of the phased flow difference, decomposed 
a.s n + \C\ + L. The first stage represents the original situation. Highlighted 
is a component of created and ancestral edges with self-looping edges which 
need to be simplified with a DCJ operation. At the next stage, a duplication 
region is highlighted. Starting from a duplicated segment s, we extend a 
path W through g^, which allows us to define a specific duplication (the sec- 
ondary copy is marked with an asterisk). On the third stage, we highlighted 
the deletion regions which need to be added to g\, along with the circular 
path (orange lines) which connect them. After this last transformation, the 
two genomes are separated by a simple sequence of DCJ operations. 
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Figure 6: Constructing an evolutionary sequence from the pairs of genomes 
constructed in Figure 5. The exact evolutionary sequence between genomes 
gl and gl, decomposed into single DCJ events, is not described here, as it 
can be easily inferred, as show by Yancopoulos et al. [2005] 
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Figure 7: Alternate realization of the simple overall flow change described 
in the previous example. On the second step, another copy of C is chosen 
to initiate the walk along g\, which terminates on bond h with overall flow 
change, h corresponds to a unique bond h' in 5f|. We therefore apply a tandem 
duplication of the corresponding thread, using the images of h' as boundaries 
between the primary and secondary copies. 
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Figure 8: Constructing an evolutionary sequence from the pairs of genomes 
constructed in Figure 7 
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traversed. The thread flow is assigned a coefficient tt^ equal to the minimum 
of these computed quotients. The resulting target flow, = (/„ — X] '^nft), 
is necessarily non-negative. If it is uniformly null the process terminates, else 
it re-iterates. Because at each iteration at least one edge has its target flow set 
to zero, the process terminates in a finite time. When the process terminates, 
we have produced a linear combination of thread fiows such that: 

'^nfcn = ^ {fn - fn+l) = /o " fn+1 = f 
n=l..q n=l..q 

□ 

3.2 Evolutionary metagenome sequences 

Definition 30. Given a metagenome m on G, its phased sequence graph 
Gm is the union of the phased sequence graphs of its individual genomes. 
A predecessor mapping ^ is built from the predecessor mapping of each of 
these individual phased sequence graphs. 

Definition 31. An evolutionary metagenome sequence is a sequence of metagenomes 
(mi, . . . , mnm) such that mi is defined on G, m2 is defined on Gm^ , etc. Anal- 
ogously to evolutionary genome sequences, we define a sequence of phased se- 
quence graphs {Gi, . . . Gn), a sequence of predecessor mappings (^i, . . . 
metagenome transitions, transition flow changes {Sfi, . . . , dfn) and the over- 
all flow change A/ (see Figure 9). 

Definition 32. In a metagenome transition, because the phased sequence 
graph of a metagenome is clearly divided into components, each clone of the 
derived metagenome is unambiguously assigned a set of ancestral clones in 
the ancestral metagenome, hence deflning an acyclic clonal derivation graph. 
Each branch of this graph describes a 1-step genome history between an 
ancestral clone and a portion of a derived clone. 

3.3 Cost of a metagenome sequence 

Definition 33. We deflne the following elementary metagenomic operations 
on a metagenome m: 

1. clone loss is the removal from m of a clone. 
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(on G) 



Figure 9: Example of an evolutionary genome sequence of weighted 
metagenomes and its clonal derivation graph. Note that although the weights 
within a metagenome add to 1, the weighting of a clone is independent from 
the weighting of its ancestral clones. 



2. subclone creation is the addition to m of a new clone g identical to a 
pre-existing clone g'. 

3. clone mutation is the alteration of a genome of m by an elementary 
genomic operation. 

Definition 34. An elementary evolutionary metagenome sequence is an evo- 
lutionary sequence in which two consecutive metagenomes are identical or 
differ by an elementary metagenomic operation. Its cost is the number of 
DCJ operations it contains. 

Definition 35. Given an evolutionary metagenome sequence S — (mi, . . . , m^), 
and a subsequence (1, ii, . . . , ik, n) of (1, n), we create a contraction S' of S, 
consisting of the metagenome sequence {gi, gi^, . . . , gi^,gn) and the predeces- 
sor mapping sequence (^'i, (^^2 o . . . o ^jj, . . . , o . . . o S is then 
said to be a realization of S'. 

Definition 36. The cost Cg{S) of an evolutionary metagenome sequence is 
the minimum cost of any of its realizations. 
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3.4 Cost of simple metagenome sequences 



Definition 37. An evolutionary metagenome sequence is simple if the flow 
change of every 1-step genome transition along every branch of the clone 
derivation graph is simple or null. 

Theorem 4. An elementary evolutionary metagenome sequence is simple. 

Proof. The flow change between two metagenomes in an elementary metagenome 
sequence is either null or equal to the flow change between two genomes in 
an elementary genome sequence, hence a simple flow. □ 

Theorem 5. For any simple evolutionary metagenome sequence S, Cg{S) 
is equal to the sum of the costs of the 1-step simple genome histories at each 
branch of the clone derivation graph. 

Proof. The realization a a given metagenome transition within a metagenome 
sequence does not affect the cost of the other transitions, hence they can be 
constructed independently. In addition the realization of a clonal branch does 
not affect the cost of the cost of the other branches in the same transition, 
therefore they can be constructed independently. □ 

4 Fitting the model to experimental data 

4.1 Flow sequences 

Definition 38. A flow sequence on G is a sequence of metagenomic flows of 
G such that if one of these flows assigns to a given segment edge, then all 
subsequent flows assign to that edge. 

Definition 39. Given a metagenome sequence S — (mi,...,mn), we can 
construct a flow sequence Sf: 

Sf^{Hfi9i)),...,Mf{9n))) 

5 is called a realization of Sf. 
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4.2 Simple flow sequences 

Definition 40. A simple flow sequence is a flow sequence such that the 
difference between any two successive flows is a simple flow. 

Lemma 5. The flow sequence associated to an elementary metagenome se- 
quence is necessarily simple itself. 

Proof. If the flow difference between two consecutive genome is a DCJ flow, 
then its projection is necessarily a simple flow, as illustrated in figure 10. 

If the flow difference is a thread duplication or removal, then its projection 
on G is a thread flow multiplied by 1 or -1. It is trivial to decompose up 
a thread flow into a sum of simple thread flows, using the transformations 
illustrated in Figures 3, 2 and 1. □ 

4.3 Cost of flow sequences 

Definition 41. The cost CpS of a flow sequence in J^^iG) is the minimal 

cost across all of its realizations. 

Definition 42. The apparent complexity of a flow sequence S — (/i, /2, ■ ■ ■ , fn) 
is equal to: 

n-l 

1=1 

Supplemental Lemma 12. Within an elementary evolutionary metagenome 
sequence, at a transition (mi,mi+i); 

C(^,(/(m,)),^',+i(/(m,+i)) < C(/K),*i+i(/(m,+i)) 

Proof. We project (/(mj), ^'^(/(mj+i))) onto an arbitrary sequence of se- 
quence graphs starting at Gi and finishing at G such that at each step two 
vertices are merged. 

At most two components of active bonds are merged into one. If 5|C| = 
— 1, two distinct components are merged. Since they were disconnected, n 
and L are unchanged, and 50 is equal to 1, thus 5C = 0. If d\C\ = then at 
most two non- alternating cycles are created, since at most two such cycles 
can go through a single node in a simple flow. This means that 5L < 1. 
Either Sn is equal to -1 (both vertices had an ancestral bond to a common 
node) or SO is equal to 1 (otherwise). Thus the complexity does not increase. 
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By iteratively merging vertices, we reach a sequence graph isomorphic to 
G. We have thus proven that the complexity of a flow on Gi is necessarily 
greater than the complexity of its projection onto G. □ 

Lemma 6. For any flow sequence S : 

C{S) < Cf{S) 

Proof. We already demonstrated that Cf{S) is greater than the sum of the 
complexities of its flow changes computed on their respective phased sequence 
graphs. We also demonstrated that for any pair of flows on a phased sequence 
graph, their complexity is greater than the complexity of their projections 
on G. □ 
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Figure 10: Projections of a DCJ flow onto G. The vertices are labelled 
according to their projected image on G. Beside each transformation the 
complexity is computed, decomposed as n — \C\ + L — O. 



Definition 43. For a given flow sequence 5, its recombination cost is defined 
as the positive number: 

Rc{S)=Gf{S)~C{S) 
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4.4 Rooted Metagenome Transitions 

Definition 44. A rooted metagenome transition is a 1-step evolutionary 
metagenome sequence of weighted metagenomes S — {{w^^\ mi), 7712)), 
where mi has a single clone g that necessarily has weight 1. We refer to 51 as 
the root genome of S. 

5 Computational Complexity 

Supplemental Definition 2. A CNV flow change is a valid flow change 
(/i; such that /2 — /i is the alternating flow to a thread. 

Supplemental Definition 3. A rearrangement flow change is a valid flow 
change (/i, /2) such that /2 — /i is non-zero on bond edges only. 

Supplemental Lemma 13. The complexity of a breakpoint flow change is 
equal to the number of bond deletions minus 1. 

Proof. Because (/i, /2) is a valid flow change, the complexity term O is equal 
to 0. 

For a DCJ flow change to create a non-alternating cycle of created and 
deleted bonds, it would be necessary to have two bond deletions incident 
upon the same vertex or an ancestral bond deleted twice. Since both are 
impossible, L = 0. Hence the complexity is equal to n — \C\. Let Ui be 
the number of destroyed bonds, and 77,2 = n — Ui. The components of C 
are similarly divided into two sets: the singletons of ancestral bonds which 
are not destroyed, and the single alternating cycle of created and destroyed 
bonds. Hence |C| =772 + 1 and the complexity is equal to tii — 1. □ 

Theorem 6. It is NP-hard to compute Cg{S) for a rooted metagenome tran- 
sition S. 

Proof. We will prove that it is NP-complete by reduction of the similar MAX- 
ACD problem [Caprara, 1999]. 

Given a bi-colored (red and yellow) Eulerian graph Gb, we create a se- 
quence graph Gs- For each colored edge in Gb we create a segment edge 
in Gs, and for each vertex in a segment edge with arbitrary left and 
right sides in Gs- Original threads are created in Gs by alternating between 
the segment corresponding to a yellow edge, the segment corresponding to 
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the end of that edge (left to right), a random segment edge which still has 
uncovered yellow edges (right to back), a yellow segment, etc. The segment 
edges corresponding to the red edges are circularized. A derived genome is 
created similarly, but with opposite colors. The only constraint on the sec- 
ond genome is that the backside bonds be identical in multiplicity to those 
in the original genome. See Figure 11 for an illustration. Prom these two 
genomes, we obtain a rooted metagenome transition on Gs- 

For a maximum cardinal alternating Eulerian cycle decomposition of G^, 
M, solution of the MAX-ACD problem, we construct a solution to the rooted 
metagenome transition problem constructed above on Gs, S. Each cycle of 
the decomposition is transformed into an alternating flow: yellow edges are 
assigned -1, red edges +1. An evolutionary metagenome sequence is created 
from these alternating flows. For each alternating flow in random order, 
a clone creation with that flow change is created, then its sibling clone is 
deleted. Because no segment is duplicated or deleted, the sequence graph 
remains unchanged throughout this evolutionary sequence. The evolutionary 
metagenome sequence S only contains rearrangement flow changes, therefore 
if Ny denotes the number of yellow edges in Gb'- 

Gg{S) = 3Ny - \M\ 

We now show that this solution is an optimal realization of the metagenome 
transition. Let us choose an optimal realization P of the metagenome tran- 
sition. As shown earlier, each clone derivation branch in P can be extended 
into an evolutionary metagenome sequence P' composed of only breakpoint 
or CNV flow changes. Finally, we compute the flow sequence of P' on Gs, 
P" . As seen earlier: 

Cg{S) > Gg{P) = Gg{P') = C{P') > C{P") 

The complexity C{P") is greater than the total complexity of its rear- 
rangement flow changes, which is equal to the number of edge deletions con- 
tained in them minus the number of these rearrangement flow changes. Let 
Ns and Npn be the number of rearrangement flow changes in 5* and P" re- 
spectively. Let ds and dpn be the number of bond deletions in the rearrange- 
ment flow changes of S and P" respectively. Because the cycle decomposition 
governing the creation of S is maximal and the rearrangement cycles of P" 
possibly covers more bond edges than those of S, Npn — Ns' < dpn — ds- 
Hence: 

Cg{S) > Gg{P) > C{P") > dp,, - Np„ >ds-Ns^ Gg{S) 
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Cg{S) = Cg{P) 

\M\ = 3Ny - Cg{P) 

The cardinal of solution to the MAX-ACD problem can therefore be com- 
puted using the cost of the optimal realization of a rooted metagenome tran- 
sition. □ 




Figure 11: Reducing a MAX-ACD problem into a rearrangement problem 
with duplication and deletions. 

5.1 Heuristic sampling 

5.2 Ergodic tour of the parameter space 

Definition 45. The traversal t is tight if Cj = Cj implies that {i — j) is even. 
See Figure 1 for an example. 
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Definition 46. A near-simple cycle traversal is an even-length cycle traversal 
which is tight and synchronized. Its derived flow is a near-simple flow. A 
near-simple flow sequence is composed of near-simple flows. 

Lemma 7. Under the above heuristic assumptions, the set of near-simple 
flow sequences with linearly independent flows which connect an original 
genome to a given metagenome flow is finite. 

Proof. Given a simple flow, any of its tight traversals cannot visit a given 
edge more than twice, so the total number of simple flows of the graph is 
bounded. Any acceptable solution should be a set of linearly independent 
flows, so their total number is bounded. It is then necessary to build a 
descent tree, such that at least half of the branches are assigned at least one 
flow. The number of possible trees is thus bounded. Finally, the number of 
possible phased flow change constructions is simply a flnite combination of 
flnite phasing decisions, producing a flnite space. □ 

Definition 47. If there exist near-simple cycle traversals ti = {ei, 62..., e2n} 
and ^2 = {ci, e2---, ^2!/} such that they start at the same vertex and overlap 
over their last p > edges, but ei 7^ ei, we produce a third flow C3, such 
that: 



ta can always be broken up into a set of near-simple cycle traversals T. 
This transformation of {^1,^2} into {ti} U T is called a merge. 



h = {ei. 



lu—p—lj 



63 83 ^e'* e'4 





Figure 12: Two overlapping flows transformed by a merge. 
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Definition 48. Given two tight cycle traversals {ej}i=i..2n and {ei}i=i„2i^ 
wliicli overlap over their first /, such that I is odd, we can construct three 
new cycle traversals: 



where £■ is a bond edge which connects the destination vertex of ei to the 
starting vertex of Ci. 

Alternately, we can build the following three cycle traversals: 



where Ei is a bond edge which connects the destination vertex of ei to co, and 
E2 is a bond edge which connects w to cu and connects u to the starting 
vertex of ei. 

If I is even and greater than zero, then we construct the following cycle 
traversals: 



where Ei is a bond edge which connects the destination vertex of e; to 00, 
and E2 is a bond edge which connects uj to the starting vertex of ei. 
This transformation is referred to as a split. 

Theorem 7. The process of applying random merges and splits is ergodic 
over all the independent near-simple flow sets which span a given net flow 
change. 



ti = {E,ei+i...e2n} 
h — {E,ei+i...e2L,} 
ts = {si,. ..ei,E} 



ti — {El, E2, E3,ei+i...e2n} 
h — {El, E2, E^,ei+i...e2u} 
ts = {ei,...ei,Es,E2,Ei} 



ti 

t2 



{El, E2, e/+i...e2n} 
{Ei,E2, €1+1. ..e2u} 
{ei,...ei,E2,Ei} 
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Figure 13: Two overlapping flows transformed by a split. 



Proof. We are given a set of independent near-flows H which span a flow 
/ G J^{G), and wish to reach a target set of independent near-flows H' which 
also span /. 

We first reduce the scope of the problem by restricting it to edges where 
/ is not zero. For every edge where / is 0, yet flows of H have non-zero 
values, we progressively split these flows, until a single flow traverses that 
edge. When decomposing / into vectors of H, that last vector necessarily 
has a weighting of 0, hence it can be removed from the set. The splits can 
be reversed by merges. We do the same with the flows of H'. After this 
transformation, H' and H cover exactly the same edges. 

We choose a flow of H' , and one of its tight traversals t' . Since H and 
H' cover the same set of edges, it is possible to define a sequence of flows 
of H which cover each of the edges of t. The flows are sequentially merged, 
since they share an overlap of at least one vertex with the next flow. It is 
thus possible to reconstruct t. We then re-iterate by reconstructing a flow of 
H' \ {/} from the flows of H" \ {/}. When that iteration is done, we have 
reconstructed all the flows of H' using merges and splits on the flows H. □ 

6 Simple flows and cactus graphs 

Definition 49. A cactus graph is a graph in which any two simple cycles 
intersect in at most one vertex, which may be called an intersection point 
[Harary and Uhlenbeck, 1952]. A cutpoint in a graph is a vertex that if 
removed splits the connected component in which it resides into two or more 
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connected components called subcomponents. 

Supplemental Lemma 14. Given a simple cycle traversal t = {ei}i=i„i 
(i.e. without vertex re-use) and a set of properly nested pairs of indices in 
[1, Z]^, merging the vertices at the end of the edges indicated by the index pairs 
creates a cactus graph. 

Proof. We will demonstrate that the graph is 2-edge connected, i.e. for any 
two distinct vertices A and B in the graph, 2 edge removals are sufficient to 
disconnect the nodes. Both A and B can be traced to the merging of two 
sets of vertices in the original cycle. Let I{A) (resp. I{B)) be the sets of 
indices of original vertices merged into A (B). Both sets cannot overlap, else 
they would be identical and A and B would not be distinct. Without loss of 
generality wc assume that min{I{A)) < min{I{B)). Because of the nesting 
constraint, no vertex in [min{I{B)),max{I{B))] can be merged to a vertex 
outside, therefore cutting the edges at indices min{I{B)) and max{I{B)) + 1 
breaks the graph into two components. Because all the indices of I{B) belong 
to the interval, whereas all the indices of I{A) are outside of this interval, A 
is disconnected from B. Thus, the final graph is 2-edge connected, in other 
words it is a cactus graph. □ 

See Figure 14 for an example. 

Definition 50. In a cutpoint split a cutpoint is not removed but rather 
replaced by a set of vertices, one vertex for each subcomponent that would 
have been created by its removal. An even-odd cactus graph is a a cactus 
graph in which each connected component has an even number of edges, and 
each cutpoint splits its component into two subcomponents, each with an 
odd number of edges (see Figure 15. b for an example). 

Definition 51. For any integer flow / on G, its unitized flow G{f) is the 
graph (V/, Ef) where V/ is the set of the vertices of G that have an edge with 
nonzero flow incident upon them and Ef is such that for any pair of nodes 
{x^y) connected in G by an edge with nonzero flow n, there if there are \n\ 
edges between x and y in G{f), each with flow +1 if n is positive and each 
with flow —1 if n is negative. 

Lemma 8. A flow f on G is simple if and only if the graph of its unitized 
flow G{f) is a single- component even-odd cactus graph. 
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Figure 14: Transforming a traversal with synchronized and nested coinciding 
pairs (shown with dotted hnes) into a cactus graph. Note how the odd dis- 
tance between coinciding indices determines the color pattern at each simple 
cycle intersection, such that an alternating cycle must necessarily go from 
one cycle to its neighbour. 




Figure 15: (a) A cactus graph, (b) an even-odd cactus graphs 



36 



Proof. Let / be a flow on G, and t one of its tiglit traversals. Let G' be tfie 
subgrapli of edges witli non-zero weiglit. G' can be constructing by taking a 
simple cycle (i.e. that does not visit any edge twice) of the same length as 
t. Given a traversal of this cycle, the nodes are assigned sequential indices. 
We then merge the nodes whose indices coincide in t. As shown above this 
creates a cactus graph. □ 
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